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Abstract 



We introduce a new discrepancy score between two distributions that gives an indi- 
cation on their similarity. While much research has been done to determine if two 
samples come from exactly the same distribution, much less research considered 
the problem of determining if two finite samples come from similar distributions. 
The new score gives an intuitive interpretation of similarity; it optimally perturbs 
the distributions so that they best fit each other. The score is defined between 
distributions, and can be efficiently estimated from samples. We provide conver- 
gence bounds of the estimated score, and develop hypothesis testing procedures 
that test if two data sets come from similar distributions. The statistical power of 
this procedures is presented in simulations. We also compare the score's capacity 
to detect similarity with that of other known measures on real data. 



1 Introduction 

The question of similarity between two sets of examples is common to many fields, including statis- 
tics, data mining, machine learning and computer vision. For example, in machine learning, a 
standard assumption is that the training and test data are generated from the same distribution. How- 
ever, in some scenarios, such as Domain Adaptation (DA), this is not the case and the distributions 
are only assumed similar. It is quite intuitive to denote when two inputs are similar in nature, yet the 
following question remains open: given two sets of examples, how do we test whether or not they 
were generated by similar distributions? The main focus of this work is providing a similarity score 
and a corresponding statistical procedure that gives one possible answer to this question. 

Discrepancy between distributions has been studied for decades, and a wide variety of distance 
scores have been proposed. However, not all proposed scores can be used for testing similarity. 
The main difficulty is that most scores have not been designed for statistical testing of similarity 
but equality, known as the Two-Sample Problem (TSP). Formally, let P and Q be the generating 
distributions of the data; the TSP tests the null hypothesis Hq : P ~ Q against the general alternative 
Hi : P Q. This is one of the classical problems in statistics. However, sometimes, like in DA, 
the interesting question is with regards to similarity rather than equality. By design, most equality 
tests may not be transformed to test similarity; see Section[3]for a review of representative works. 

In this work, we quantify similarity using a new score, the Perturbed Variation (PV). We propose 
that similarity is related to some predefined value of permitted variations. Consider the gait of two 
male subjects as an example. If their physical characteristics are similar, we expect their walk to 
be similar, and thus assume the examples representing the two are from similar distributions. This 
intuition applies when the distribution of our measurements only endures small changes for people 
with similar characteristics. Put more generally, similarity depends on what "small changes" are in 
a given application, and implies that similarity is domain specific. The PV, as hinted by its name, 
measures the discrepancy between two distributions while allowing for some perturbation of each 
distribution; that is, it allows small differences between the distributions. What accounts for small 
differences is a parameter of the P V, and may be defined by the user with regard to a specific domain. 
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Figure 1: X and O identify samples from two distributions, doted circles denote allowed perturbations. 
Samples marked in red are matched with neighbors, while the unmatched samples indicate the PV discrepancy. 



Figure[T|illustrates the PV. Note that, like perceptual similarity, the PV turns a blind eye to variations 
of some rate. 



2 The Perturbed Variation 

The PV on continuous distributions is defined as follows: 

Definition 1. Let P and Q be two distributions on a Banach space X, and let M(P^ Q) be the set 
of all joint distributions on X x X with mariginals P and Q. The PV, with respect to a distance 
function d : X x X M. and e, is defined by 

PV{P,Q,e,d)^ inf ¥^[diX,Y)>e], (1) 
MeAf(p,Q) 

over all pairs (X, Y) ^ fi, such that the marginal of X is P and the marginal ofY is Q. 

Put into words. Equation ([T]l defines the joint distribution /i that couples the two distributions such 
that the probability of the event of a pair {X, Y) ^ fi being within a distance grater than e is 
minimized. 

The solution to ([T]l is a special case of the classical mass transport problem of Monge fTl and its 
version by Kantorovich: 'mi^^M{P,Q) Jxxx ''(^' v)'^l^{^^ y)i where ciA'xA'— ^Misa measurable 
cost function. When c is a metric, the problem describes the 1^* Wasserstein metric. Problem ([l]) 
may be rephrased as the optimal mass transport problem with the cost function c(x, y) = l[d(a;,jf)>f]; 
and may be rewritten as vai^ jj ly^i^^ y)y^]^ii{y\x)dy P{x)dx. The probability iJL{y\x) defines the 
transportation plan of x to y. The PV optimal transportation plan is obtained by perturbing the mass 
of each point x in its e neighborhood so that it redistributes to the distribution of Q. These small 
perturbations do not add any cost, while transportation of mass to further areas is equally costly. 
Note that when P = Q the PV is zero as the optimal plan is simply the identity mapping. Due to 
its cost function, the PV it is not a metric, as it is symmetric but does not comply with the triangle 
inequality and may be zero for distributions P ^ Q. Despite this limitation, this cost function fully 
quantifies the intuition that small variations should not be penalized when similarity is considered. 
In this sense, similarity is not unique by definition, as more than one distribution can be similar to a 
reference distribution. 

The PV is also closely related to the Total Variation distance (TV) that may be written, using a 
coupling characterization, as TV{P, Q) = inf^gj\/(p Q) [X ^ Y] |2|. This formulation argues 
that any transportation plan, even to a close neight)or, is costly. Due to this property, the TV is 
known to be an overly sensitive measure that overestimates the distance between distributions. For 
example, consider two distributions defined by the dirac delta functions d{a) and 5{a + e). For any 
e, the TV between the two distributions is 1, while they are intuitively similar. The PV resolves this 
problem by adding perturbations, and therefore is a natural extension of the TV. Notice, however, 
that the e used to compute the PV need not be infinitesimal, and is defined by the user 

The PV can be seen as a conciliatory between the Wasserstein distance and the TV. As explained, it 
relaxes the sensitivity of the TV; however, it does not "over optimize" the transportation plan. Specif- 
ically, distances larger than the allowed perturbation are discarded. This aspect also contributes to 
the efficiency of estimation of the PV from samples; see Section|2]2] 
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Figure 2.1: Illustration of the PV score between discrete distributions. 



2.1 The Perturbed Variation on Discrete Distributions 

It can be shown that for two discrete distributions Problem ([TJ is equivalent to the following problem. 

Definition 2. Let fii and fi2 be two discrete distributions on the unified support {ai, a^v}- Define 
the neighborhood of ai as ng{ai,e) = {z;c?(z,ai) < e}. The f y(/ii, /i2: d) between the two 
distributions is: jy n 

min - > ui, H — > w, (2) 

- ■ - ■ i=i j=i 

i.f. ^ +Wi ^ ^i(ai), Vi 

ajg«g(ai,e) 

X! +Vj = fi2{aj),yj 

ai ^ng{aj ,e) 

Z.y=0, y{i,j) ^ ng{a,,e). 

Each row in the matrix Z E M.^^^ corresponds to a point mass in /ii, and each column to a point 
mass in /i2- For each i, Z{i,:) is zero in columns corresponding to non neighboring elements, and 
non-zero only for columns j for which transportation between /i2(aj) Mi('^i) is performed. The 
discrepancies between the distributions are depicted by the scalars Wi and Vi that count the "leftover" 
mass in /ii(ai) and ^2{o.j)- The objective is to minimize these discrepancies, therefore matrix Z 
describes the optimal transportation plan constrained to e-perturbations. An example of an optimal 
plan is presented in Figure [2T[ 



2.2 Estimation of the Perturbed Variation 

Typically, we are given samples from which we would like to estimate the PV. Given two sam- 
ples Si — {xi,...,Xn} and 52 ~ {yi, ...,2/,„}, generated by distributions P and Q respectively. 

PV(5i,52,e,d) is: 

mm > w. 

2rn. 



Wi + - — Vi (3) 

>o,u,>o,z,j>o 2n ^ 2m, ^ ■' 



s.t. 2^ Zij+Wi^X, 2^ Zij+Uj = l, Vi,j 

where Z G M"^™. When n — m, the optimization in ^ is identical to (j2]i, as in this case the 
samples define a discrete distribution. However, when n ^ m Problem ^ also accounts for the 
difference in the size of the two samples. 

Problem (j3]l is a linear program with constraints that may be written as a totally unimodular matrix. 
It follows that one of the optimal solutions of ([3]l is integral |3|; that is, the mass of each sample 
is transferred as a whole. This solution may be found by solving the optimal assignment on an 
appropriate bipartite graph |3|. Let G = {V = {A, B),E) define this graph, with A = {xi,Wi ; i = 
1, ...,n} and B = {yj,Vj ; j = 1, ...,m} as its bipartite partition. The vertices Xi G A aie linked 
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Algorithm 1 Compute PV{Si, S2, e, d) 



Input: 5*1 = {xi, Xn} and ^2 = {yi, Um}, e rate, and distance measure d. 

1. Define G = [V = (i, B), E): A = {x, £ Si}, B = {y, e S2}, 
Connect an edge G E if d{xi, yj) < e. 

2. Compute the maximum matching on G. 

3. Define S^, and Sy as number of unmatched edges in sets and S2 respectively. 
Output: PV{Si,S2,e,d) = ^(^f + If)- 



with edge weight zero to yj G ng(a;i) and with weight 00 to yj ^ ng{xi). In addition, every vertex 
Xi iyj) is hnked with weight 1 to Wi (vj). To make the graph complete, assign zero cost edges 
between all vertices Xi and Wk for fc 7^ i (and vertices yj and Vk for k 7^ j). 

We note that the Earth Mover Distance (EMD) |4|, a sampled version of the transportation problem, 
is also formulated by a linear program that may be solved by optimal assignment. For the EMD and 
other typical assignment problems, the computational complexity is more demanding, for example 
using the Hungarian algorithm it has an 0{N'^ ) complexity, where N = n + m is the number of ver- 
tices [Tl. Contrarily, graph G, which describes PV, is a simple bipartite graph for which maximum 
cardinality matching, a much simpler problem, can be applied to find the optimal assignment. To 
find the optimal assignment, first solve the maximum matching on the partial graph between vertices 
Xi, yj that have zero weight edges (corresponding to neighboring vertices). Then, assign vertices Xi 
and yj for whom a match was not found with Wi and Vj respectively; see Algorithm [T] and Figure 
[T]for an illustration of a matching. It is easy to see that the solution obtained solves the assignment 
problem associated with PV. 

The complexity of Algorithm [T| amounts to the complexity of the maximal matching step and of 
setting up the graph, i.e., additional 0{nm) complexity of computing distances between all points. 
Let k be the average number of neighbors of a sample, then the average number of edges in the 
bipartite graph G is \E\ = n x k. The maximal cardinality matching of this graph is obtained in 
0{kn\J (n + ra)) steps, in the worst case ISj. 

3 Related Work 

Many scores have been defined for testing discrepancy between distributions. We focus on represen- 
tative works for nonparametric tests that are most related to our work. First, we consider statistics for 
the Two Sample Problem (TSP), i.e., equality testing, that are based on the asymptotic distribution of 
the statistic conditioned on the equality. Among these tests is the well known Kolmogorov-Smirnov 
test (for one dimensional distributions), and its generalization to higher dimensions by minimal 
spanning trees |6 1. A different statistic is defined by the portion of k-nearest neighbors of each sam- 
ple that belongs to different distributions; larger portions mean the distributions are closer |7 1. These 
scores are well known in the statistical literature but cannot be easily changed to test similarity, as 
their analysis relies on testing equality. 

As discussed earlier, the 1** Wasserstein metric and the TV metric have some relation to the PV. 
The EMD and histogram based Li distance are the sample based estimates of these metrics respec- 
tively. In both cases, the distance is not estimated directly on the samples, but on a higher level 
partition of the space: histogram bins or signatures (cluster centers). As a result, these estimators 
have inaccuracies. Contrarily, the PV is estimated directly on the samples and converges to its value 
between the underlying continuous distributions. We note that after a good choice of signatures, the 
EMD captures perceptual similarity, similar to that of the PV. However, due to the abstraction to 
signatures, the EMD does not converge to the Wasserstein metric between the continuous distribu- 
tions, and therefore is commonly used to rate distances and not for statistical testing. It is possible 
to consider the PV as a refinement of the EMD notion of similarity; instead of clustering the data 
to signatures and moving the signatures, it perturbs each sample. In this manner it captures a finer 
notion of the perceptual similarity. 
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(a) PV(e = 0.1) = (b) PV(e = 0.1) = (c) PV(£ = 0.1) = 1 

Figure 2: Two distributions on R: The PV captures the perceptual similarity of (a),(b) against the disimilarity 
in(c). The Ll = 1 on /i = {(0, 0.1), (0.1, 0.2), ...} for all cases; on /2 = {(0, 0.2), (0.2, 0.4), ...} it is 
Ll{Pa,Qa) =0,Lj{Pt,Qb) = l,Lj(P,,Qc) = l;andon/3 = {(0, 0.3), (0.3, 0.6), ...} it is L?(P,, Q,) = 
0, Ll{Pb,Qb) ^ 0, Ll{Pc,Qc) = 0. 

The partition of the support to bins allows some relaxation of the TV notion. Therefore, instead 
of the TV, it may be interesting to consider the Li as a similarity distance on the measures after 
discretization. The example in Figure (j2| shows that this relaxation is quite rigid and that there is no 
single partition that captures the perceptual similarity. In general, the problem would remain even 
if bins with varying width were permitted. Namely, the problem is the choice of a single partition 
to measure similarity of a reference distribution to multiple distributions, while choosing multiple 
partitions would make the distances incomparable. Also note that defining a "good" partition is a 
difficult task, which is exasperated in higher dimensions. 

The last group of statistics are scores established in machine learning: the dA distance presented by 
Kifer et al. that is based on the maximum discrepancy on a chosen subset of the support |8|, and 
Maximum Mean Discrepancy (MMD) by Gretton et al., which define discrepancy after embeddings 
the distributions to a Reproducing Kernel Hilbert Space (RKHS)l9l. These scores have coiTespond- 
ing statistical tests for the TSP; however, since their analysis is based on finite convergence bounds, 
in principle they may be modified to test similarity. The captures some intuitive notion of sim- 
ilarity, however, to our knowledge, it is not known how to compute it for a general subset class 
The MMD captures the distance between the samples in some RKHS. While this distance perfectly 
defines an equality test, it is not clear if it translates to a well defined similarity test. As an example, 
consider testing if the MMD is grater than some value larger than zero using the RBF kernel. To do 
so, the parameter a must be chosen in advance. Clearly, the result of the test is highly dependent on 
this choice, but it is not clear how it should be made. Contrarily, the PV's parameter e is related to 
the data's input domain and may be chosen accordingly. 



4 Analysis 

We present sample rate convergence analysis of the PV. The proofs of the theorems are provided in 
the supplementary material. When no clarity is lost, we omit d from the notation. Our main theorem 
is stated as follows: 

Theorem 3. Suppose we are given two i.i.d. samples Si ~ {xi,...,x„} G M'' and S2 = 
{2/1, ...,?/,„} e M'* generated by distributions P and Q, respectively. Let the ground distance be 
d = II • lloo <^nd let be the cardinality of a disjoint cover of the distributions' support. Then, 

for any 5 G (0, 1), N = min(n, ra), and rj = ^ ^(^°s(^(^ ''''-2))+iog(i/(5 )y have that 

< v) > l-S. 



PV{Si,S2,e)-PV{P,Q,e] 

The theorem is defined using || • ||oo, but can be rewritten for other metrics (with a slight change of 
constants). The proof of the theorem exploits the form of the optimization Problem [3] We use the 
bound of Theorem |3] construct hypothesis tests. A weakness of this bound is its strong dependency 
on the dimension. Specifically, it is dependent on A/^(e), which for || • ||oo is 0((l/e)'^): the number of 
disjoint boxes of volume e'^ that cover the support. Unfortunately, this convergence rate is inherent; 
namely, without making any further assumptions on the distribution, this rate is unavoidable and is 
an instance of the "curse of dimensionality". In the following theorem, we present a lower bound on 
the convergence rate. 



'Most work with the (Ia has been with the subset of characteristic functions, and approximated by the error 
of a classifier. 
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Theorem 4. Let P = Q be the uniform distribution on S'' ^, a unit (d — \ )-dimensional hyper- 
sphere. Let Si — {xi,...,XAr} ^ P and S2 = {yi, j/at} Q be two i.i.d. samples. For 

any e,e',S G (0,1), < ry < 2/3 and sample size 2(1-^^/2^- < ^ < |e''^^~T)/2^ we have 
PV (P, Q, e') = and 

¥{PViSi,S2,e)>rj)>l-S. (4) 

For example, for S = 0.01, r] = 0.5, for any 37 < iV < 0.25e'^(i-T)/2 we have that PV > 0.5 with 
probability at least 0.99. The theorem shows that, for this choice of distributions, for a sample size 
that is smaller than 0{e/), there is a high probability that the value of PV is far form PV. 

It can be observed that the empirical estimate PV is stable, that is, it is almost identical for two 
data sets differing on one sample. Due to its stability, applying McDiarmid inequality yields the 
following. 

Theorems. Let Si = {xi,...,Xn} ^ P and S2 — {j/i,---,2/m} ^ Q be two i.i.d. samples. Let 
n > m, then for any rj > 



(I' 



\PV{Si,S2,e)^E[PV{n,m,e)]\>Tj) < e"" " Z^", 
where E[PV {ri, m, e)] is the expectation ofPVfor a given sample size. 

This theorem shows that the sample estimate of the PV converges to its expectation without depen- 
dence on the dimension. By combining this result with Theorem [3] it may be deduced that only the 
convergence of the bias - the difference |E[PV(rt, m, e)] — PV(P, Q,e)\- may be exponential in the 
dimension. This convergence is distribution dependent. However, intuitively, slow convergence is 
not always the case, for example when the support of the distributions lies in a lower dimensional 
manifold of the space. To remedy this dependency we propose a bootstrapping bias correcting tech- 
nique, presented in Section |5] A different possibility is to project the data to one dimension; due 
to space limitations, this extension of the PV is left out of the scope of this paper and presented in 



Appendix A. 2 in the supplementary material. 



5 Statistical Inference 



We construct two types of complementary procedures for hypothesis testing of similarity and dis- 
similarit}]^ In the first type of procedures, given < 6* < 1, we distinguish between the null 
hypothesis "Hq^-* : PV(P, Q, e,d) < 9, which implies similarity, and the alternative hypothesis 

n[^^ : PV(P, Q, e, d) > 0. Notice that when 6* = 0, this test is a relaxed version of the TSP Using 
PV(P, Q) = instead of P = Q as the null, allows for some distinction between the distributions, 
which gives the needed relaxation to capture similarity. In the second type of procedures, we test 
whether two distributions are similar To do so, we flip the role of the null and the alternative. Note 
that there isn't an equivalent of this form for the TSP, therefore we can not infer similarity using 
the TSP test, but only reject equality. Our hypothesis tests are based on the finite sample analysis 
presented in Section|4j see Appendix |A.l| in the supplementary material for the procedures. 

To provide further inference on the PV, we apply bootstrapping for approximations of Confidence 
Intervals (CI). The idea of bootstrapping for estimating CIs is based on a two step procedure: ap- 
proximation of the sampling distribution of the statistic by resampling with replacement from the 
initial sample - the bootstrap stage - following, a computation of the CI based on the resulting dis- 
tribution. We propose to estimate the CI by Bootstrap Bias-Corrected accelerated (BCa) interval, 
which adjusts the simple percentile method to correct for bias and skewness ifTOI . The BCa is known 
for its high accuracy; particularly, it can be shown, that the BCa interval converges to the theoretical 
CI with rate 0{N^^), where N is the sample size. Using the CI, a hypothesis test may be formed: 

the null T-L^^ is rejected with significance a if the range [0, 0] <f. [ CI , CI]. Also, for the second test, 
we apply the principle of CI inclusion ifTTl . which states that if [ CI , CI] C [0, 0], dissimilarity is 
rejected and similarity deduced. 

^The two procedures are distinct, as, in general, lacking evidence to reject similarity is not sufficient to infer 
dissimilarity, and vice versa. 
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(a) The Type-2 error for varying (b) Precision-Recall; Gait data, 
perturbation sizes and e values. 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Recall 

(c) Precision-Recall: Video clips. 



6 Experiments 

6.1 Synthetic Simulations 

In our first experiment, we examine the effect of the choice of e on the statistical power of the test. 
For this purpose, we apply significance testing for similarity on two univariate uniform distributions: 
P ^ U[Q,1] and Q ^ C/[A(e),l + A(e)], where A(e) is a varying size of perturbation. We 
considered values of e = [0.1,0.2,0.3,0.4,0.5] and sample sizes up to 5000 samples from each 

distribution. For each value e', we test the null hypothesis "Hg^^ : PV{P, Q, e') = for ten equally 
spaced values of A(e') in the range [0, 2e']. In this manner, we test the ability of the PV to detect 
similarity for different sizes of perturbations. The percentage of times the null hypothesis was falsely 
rejected, i.e. the type-1 error, was kept at a significance level a — 0.05. The percentage of times 
the null hypothesis was correctly rejected, the power of the test, was estimated as a function of the 
sample size and averaged over 500 repetitions. We repeated the simulation using the tests based on 
the bounds as well as using BCa confidence intervals. 



The results in Figure (3(a)i show the type-2 error of the bound based simulations. As expected, 
the power of the test increases as the sample size grows. Also, when finer perturbations need to be 
detected, more samples are needed to gain statistical power. For the BCa CI we obtained type-1 
and type-2 errors smaller than 0.05 for all the sample sizes. This shows that the convergence of the 
estimated PV to its value is clearly faster than the bounds. Note that, given a sufficient sample size, 
any statistic for the TSP would have rejected similarity for any A > 0. 



6.2 Comparing Distance Measures 

Next, we test the ability of the PV to measure similarity on real data. To this end, we test the ranking 
performance of the PV score against other known distributional distances. We compare the PV to 
the multivariate extension of the Wald-Wolfowitz score of Friedman & Rafsky (FR) |6| , Schilling's 
nearest neighbors score (KNN) |7|, and the Maximum Mean Discrepancy score of Gretton et al. 
(MMD^ We rank similarity for the applications of video retrieval and gait recognition. 

The ranking performance of the methods was measured by precision-recall curves, and the Mean 
Average Precision (MAP). Let r be the number of samples similar to a query sample. For each 
1 < i < r of these observations, define G [1, T — 1] as its similarity rank, where T is the total 
number of observations. The Average Precision is: AP = 1/r i/rj, and the MAP is the average 
of the AP over the queries. The tuning parameter for the methods - k for the KNN, a for the MMD 
(with RBF kernel), and e for the PV - were chosen by cross-vahdation. The Euclidian distance was 
used in all methods. 

In our first experiment, we tested raking for video-clip retrieval. The data we used was collected 
and generated by U2il , and includes 1,083 videos of commercials, each of about 1,500 frames (25 
fps). Twenty unique videos were selected as query videos, each of which has one similar clip in 

''Note that the statistical tests of these measures test equality while the PV tests similarity and therefore our 
experiments are not of statistical power but of ranking similarity. Even in the case of the distances that may be 
transformed for similarity, like the MMD, there is no known function between the PV similarity to other forms 
of similarity. As a result, there is no basis on which to compare which similarity test has better performance. 
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Table 1 : MAP for Auslan, Video, and Gait data sets. Average MAP (± standard deviation) computed on a 
random selection of 75% of the queries, repeated 100 times. 



Data set 


PV 


KNN 


MMD 


FR 


Video 


0.758 ±0.009 


0.741 ±0.014 


0.689 ±0.008 


0.563 ±0.019 


Gait 

Gait-F 

Gait-M 


0.792±0.021 
0.844±0.017 

0.679 ± 0.024 


0.736 ±0.014 
0.750 ±0.015 
0.712 ±0.017 


0.722 ± 0.017 
0.729 ±0.017 
0.716 ±0.031 


0.698 ±0.017 
0.666 ±0.016 
0.799 ±0.016 



the collection, to which 8 more similar clips were generated by different transformations: bright- 
ness increased/decreased, saturation increased/decreased, borders cropped, logo inserted, randomly 
dropped frames, and added noise frames. Lastly, each frame of a video was transformed to a 32- 
RGB representation. We computed the similarity rate for each query video to all videos in the set, 
and ranked the position of each video. The results show that the PV and the KNN score are invariant 
to most of the transformations, and outperform the FR and MMD methods (Table[T]and Figure [3(c)] l. 
We found that brightness changes were most problematic for the PV. For this type of distortion, the 
simple RGB representation is not sufficient to capture the similarity. 

We also tested gait similarity of female and male subjects; same gender samples are assumed similar 
We used gait data that was recorded by a mobile phone, available at |13 |. The data consists of two 
sets of 15min walks of 20 individuals, 10 women and 10 men. As features we used the magnitude 
of the triaxial accelerometer.We cut the raw data to intervals of approximately O.Ssecs, without 
identification of gait cycles. In this manner, each walk is represented by a collection of about 1500 
intervals. An initial scaling to [0,1] was performed once for the whole set. The comparison was 
done by ranking by gender the 39 samples with respect to a reference walk. 



The precision-recall curves in Figure 3(b) [ show that the PV is able to retrieve with higher precision 



in the mid-recall range. For the early recall points the PV did not show optimal performance; Inter- 
estingly, we found that with a smaller e, the PV had better performance on early recall points. This 
behavior reflects the flexibility of the PV: smaller e should be chosen when the goal is to find very 
similar instances, and larger when the goal is to find higher level similarity. The MAP results pre- 
sented in Table[T]show that the PV had better performance on the female subjects. From examination 
of the subject information sheet we found that the range of weight and hight within the female group 
is 50-77Kg and 1.6-1. 8m, while within the male group it is 47-lOOKg and 1.65-1. 93m; that is, there 
is much more variability in the male group. This information provides a reasonable explanation to 
the PV results, as it appears that a subject from the male group may have a gait that is as dissimilar 
to the gait of a female subject as it is to a different male. In the female group the subjects are more 
similar and therefore the precision is higher. 



7 Discussion 



We proposed a new score that measures the similarity between two multivariate distributions, and 
assigns to it a value in the range [0,1]. The sensitivity of the score, reflected by the parameter e, 
allows for flexibility that is essential for quantifying the notion of similarity. The PV is efficiently 
estimated from samples. Its low computational complexity relies on its simple binary classification 
of points as neighbors or non-neighbor points, such that optimization of distances of faraway points 
is not needed. In this manner, the PV captures only the essential information to describe similarity. 
Although it is not a metric, our experiments show that it captures the distance between similar distri- 
butions as well as well known distributional distances. Our work also includes convergence analysis 
of the PV. Based on this analysis we provide hypothesis tests that give statistical significance to the 
resulting score. While our bounds are dependent on the dimension, when the intrinsic dimension of 
the data is smaller than the domains dimension, statistical power can be gained by bootstrapping. 
In addition, the PV has an intuitive interpretation that makes it an attractive score for a meaningful 
statistical testing of similarity. Lastly, an added value of the PV is that its computation also gives 
insight to the areas of discrepancy; namely, the areas of the unmatched samples. In future work we 
plan to further explore this information, which may be valuable on its own merits. 



8 



References 

[1] G. Monge. Memoire sur la theorie des deblais et de remblais. Histoire de I'Academie Royale 
des Sciences de Paris, avec les Memoires de Mathematique et de Physique pour la meme annee, 
1781. 

[2] L. Riischendorf. Monge-kantorovich transportation problem and optimal couplings. Jahres- 
bericht der DMV, 3:113-137, 2007. 

[3] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons Inc, 1998. 

[4] Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image 
databases. In Computer Vision, 1998. Sixth International Conference on, pages 59-66. IEEE, 
1998. 

[5] R.K. Ahuja, L. Magnanti, and J.B. Orlin. Network Flows: Theory, Algorithms, and Applica- 
tions, chapter 12, pages 469-473. Prentice Hall, 1993. 

[6] J.H. Friedman and L.C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and 
Smirnov two-sample tests. Annals of Statistics, 7:697-717, 1979. 

[7] M.F. Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the 
American Statistical Association, pages 799-806, 1986. 

[8] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proceedings 
of the Thirtieth international conference on Very large data bases, pages 180-191. VLDB 
Endowment, 2004. 

[9] A. Gretton, K. Borgwardt, B. Scholkopf, M. Rasch, and E. Smola. A kernel method for the 

two sample problem. In Advances in Neural Information Processing Systems 19, 2007. 

[10] B. Efron and R. Tibshirani. An introduction to the bootstrap, chapter 14, pages 178-188. 
Chapman & HaiyCRC, 1993. 

[11] S. Wellek. Testing Statistical Hypotheses of Equivalence and Noninferiority; 2nd edition. 
Chapman and Hall/CRC, 2010. 

[12] J. Shao, Z. Huang, H. Shen, J. Shen, and X. Zhou. Distribution-based similarity measures for 
multi-dimensional point set retrieval applications. In Proceeding of the 16th ACM international 

conference on Multimedia MM 08, 2008. 

[13] J. Frank, S. Mannor, and D. Precup. Data sets: Mobile phone gait recognition data, 2010. 

[14] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M.J. Weinberger. Inequalities for the 
11 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003. 

[15] S. Boyd and L. Vandenberghe. Convex Optimization, chaptsi 5, pages 25S-261. Cambridge 
University Press, New York, NY, USA, 2004. 



9 



A Supplementary Material 

A.1 Hypothesis Testing Procedures 

The statistical tests in this section are based on the convergence bounds in Section]?] 

Notations Throughout this section the probabilities Pq and Pi represent the probability condi- 
tioned on the null hypothesis Hq, and the alternative hypothesis Hi- 

The following procedure tests the hypothesis 'Hg^^ : PV(P, Q,e) < 9 against the alternative "H^^^ : 
PV(P, Q, e) > 9. 



Procedure 1. Similarity Testing Based on PV. 
Input; e, 9 and significance level a. 

1. Sample Si = {xi, Xn} ^ P and S2 — {j/i, Um} ~ Q (define N — min(n, m)). 

2. Normalize the data to be in [0, 1]"^. 

3. Compute PV{Si, S2, e, || • |joo) by Algorithm^ 

4. Compute t = ^ ^0,(2(2'. J^^^ -2))+2io,aM ^ 



Output: Reject U^q ^ if 



PV{Si, S2.e,\\ - Woo) >t + ^ 



The probability to reject H^^p by applying Procedure 1 when in fact it holds - also known as the 
Type 1 error - is bounded in the following corollary. 

Corollary 6. Assume that for a given e and 9 values T-L^^ : PV{P^ Q, e,d) < 9 holds. Then for 
the threshold t of Procedure^and any a G (0, 1) we have that 

Po(PViSuS2.e,\\-\\oo)>t + 9) <a. (5) 



Moreover, the procedure is consistent: when n^m — 00 we have that i — > and 

Vi(PV{Si,S2,t,\\-\\oo)>9) = l. 

The corollary is a direct result of Theorem|3] 

Next, we consider the probability that Procedurejljfails to reject when the alternative hypothesis 

T-6i^ holds, also known as the Type 2 error. Unfortunately, it is not possible to bound this probability 
for a finite sample of any two distributions. To see this, consider the following example: let P, Q be 
two distributions with PV{P, Q, e) > 0, but differ only in an area of very low probability. Then, 
for any finite sample size, there is a high probabihty that the samples are identical, resulting in 

PV{Si, 5*2, e) = 0. As a result, the null hypothesis will not be rejected even though n[^'^ holds. 
However, if the PV is larger than some constant the Type 2 error is bounded. 

Corollary 7. For PV{P,Q,e,d) > 9 + t + b, with t of Procedure ]7] and b = 

2(log(2(2(V.)^-2))+21og(l/^) ^^^^ 



(py(5i,^2,e,|| - Hoc) >t + ^) > 1-/?. 



Note that as N grows, the values of b and t get smaller, and the lower bound PV{P, Q, e, d) > 
9 + t + b decreases. 

Proof. We have that 

Pi (Pv{Si,S2, e,\\-\U>t + 9) = Pi (P^(5i, S2,e,\\-\U>b + t + e-b) > 
P (pV{SuS2,e, II • lloo) > PV{P, Q, e, II • lloo) - > 1 - Z^- 
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The first inequality holds by inserting the assumption on PV, and the second holds by applying the 
convergence bound of Theorem [3] □ 



To give an estimate of the sample size needed for the procedure, first define the effect size 6q: the 
minimal value of PV that is significant. Given Oq, set the sample size so that 

^ 4 log(2(2(V^)' - 2)) + 2 log(l/a) + 2 log(l//?) 

Using this size ensures a false positive rate bounded by a (Corollary |6]l, and a false negative rate 
bounded by /3 (Corollary |7]i. 

The second test we consider is an equivalence type test Equivalence is achieved when 

PV(P, Q,e) < 9, for some chosen 6, and may be obtained by switching the roles of the null and the 

alternative of Procedure[r| Namely, to claim similarity we need to reject Hj^^ : PV(P, Q, e) > 0. To 
test this hypothesis, a similar procedure to Procedure 1 may be applied, with a principal difference 
in the rejection area, which is changed to PV{Si,S2, e, || • ||oo) < ^ — t. 



A.2 ID Projections 

We present a method to gain insight on the value of the PV by multiple random projections to one 
dimension. The PV between two distributions is not retained after projection to a single dimension, 
as the projection contracts the distance between the points. However, we show that multiple projec- 
tions can still aid to distinguish between two situations: PV(P, Q,e) — and PV(P, Q.,e) ^ 0[^ 
First, we define a score that is based on the value of the PV after projections. 

Definition 8. Let fi : M'^ — > Rfor i = 1, K, define random projection mappings. Let X and Y 
be random variables with distributions P and Q. The maximum projected score of two distributions 
P and Q is 

PPVKiP, g, e) = max Py(/,(X), /,(y), e). 

i—l,...,K 

For two samples Si ^ P and S2 ^ Q the score is 

PPVK{SuS2,e) = . max PV'(/,(5i),/,(52),e). 



The next theorem presents the convergence of PPV^ to zero for distributions with PV{P, Q) = 0. 

Tlieorem 9. Let P and Q be two distributions on the space ([0, 1]'', d), and Si = {xi, a;„} ~ P 
and S2 — {yi,...,?;™} ^ Q two i.i.d. samples (N — min(n,m)j. Perform K i.i.d. random 
projections of samples Si and S2 to one dimension. If PV{P, Q, e, d) — 0, then for any 5 G (0, 1), 
with probability at least 1 — 5 



Proof Given Pt/(P,Q,e,d) = 0, we have that for all K projections PPVu{P,Q,e, \ ■ |) = 
PV{P, Q, e, d) ~ 0, as the projection to ID is a non-expansion. 

In the following we denote by Po(^) the probability of event A under the assumption 

Py(P,Q,e,d) = 0. Denote PPV,(e) = PV(/,(S'i), /.(S'a), e) as the value of PV obtained due 
to the zth projection. 

We bound the probability of the event PPV^ (5*1 , S'2 , e) > 77: 



\ ( max^PPV,{e) > t]) = Pq ^31 < i < X : PPV^ie) > < ^ Po (pPV^ie) > 



(6) 



Recall that PV=0 not only when the distributions are equal, but also when they are e similar. 
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where the last inequality is obtained by applying the union bound. 
Combining (j6| with Theorem [s] we have that for any ?/ G (0, 1) 

Setting S = 2if (2^/^ - 2)e~^'''/2 concludes the proof. □ 

For PV(P, Q) > , we provide a similar lower bound on the maximum score. We will need a 
further assumption for this bound. 

Assumption 1. Given distributions P and Q with PV{P, Q, e) > 0, they are ID distinguishable if 

lim/f-i.oo PPVk{P, Q, e) > almost surely. 

This assumption ensures that the difference in the PV value exists in at least one projection. 

Theorem 10. Let P and Q be two distributions on the space ([0, l]*^, d). Given i — 1, K i.i.d. 

samples Sn — {xn, Xin} ^ P and Si2 = {j/ii, yim} ~ Q, and K mappings fi,for any two 
distribution that fulfill AssumptionUl the re exists some q £ (0, 1), /or which for any S G (0, 1) with 

— K 

probability at least 1 ~ {q ~ qS + S) 



p?v(i5..}.ife}..)>; ^'°^'^^"'f -^'^^) . 

The notation PPV({S'ii}, {'S'i2}, e) denotes the maximum taken over the projections of the K sets. 
Notice that q — qS + S < 1, and therefore is an exponential decay in the number of projections K. 

Proof. Let / : M'' — ^ M define a random projection mappings. Let X and Y be random variables 
generated by P and Q. Denote PVi = PV(/(X), /(F), e), and PVi(e) = PV(/(S'i), /(Sa), e). 
Note that there are two sources of randomization, the sample's and the projection's, and therefore 
PVi is also a random variable. 

We have that 

P(PPV({^a},{^»2},e) < r;) =P( max PV(/,(5,a),/,(5.2),e) < v) = (7) 

l<t<K 

P(V1 < i <K,PW{f,{Sa)J^{S^2),e) < r/) - [P(PVi(e) < 7])f, 

where the last equality holds due to the independence of the events. Next, we bound the probability 
P(PVi(e) < 77). We define complementary events A : PVi(e) > 277 and A" : PVi(e) < 2?/. 

P(PVi (e) < 77) - P{A)P{PWi{e) < 77 1 A) + P{A'')P{PVi {e)<T]\A'') (8) 

< P{A)P (pVi(e) < PVi - 77 1 + P(A^)P (pVi(e) <r]\A'' 

< P{A)2{2^^' - 2)e-^"'/2 + P(^=) < P{A)2K{2^/' - 2)e~^'^''^'^ + 1 - P{A). 



The inequality before last is obtained by applying Theoremjsjfor any 77 e (0, 1). For S = 2K{2^ 
2)g-w»7V2 we have fj = J ^^°si2K{2^/ '_-2)S) ^ Substituting this i] to (8 1 results in 



A, 



P(PVi(e) < 77) < 1 - (1 - ^)P(yl). (9) 

Let p(r/, e) = P(PVi(P, Q, e) < rf) be the distribution of the projected PV. Clearly, p{ri, e) depends 
on the generating distributions P and Q, and its support is [0, swp^{PPVi{P^ Q, e))]. We assume 
that sup^{PPVi{P, Q, e)) > 0, and therefore there must be some q E (0, 1) for which 

P(PVi(P,Q,e) < 27y) <<z. (10) 

Combining the results of (|7]i, (|9| and ( [T0| , we have that for any < S < 1 

P(PPV({5,i}, {5,2}, e) < 77) < (1 - (1 - (5)P(PVi(e) > 2fj)f < {q - qS + (5)^. 
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Therefore, with probabiUty at least 1 — {q — qS + S) 



K 



Note that for any q < 1 results in q — qa + a < 1, and therefore we get exponential decay. For 
example for q — 1/2 {2f] is smaller then the median) we have {q — q5 + 5)^ = {^-^) ■ □ 



Theorems [9] and 10 are complementary, and may be used together to infer whether or not 



PV(P, Q) ~ 0. Next, we describe the suitable hypothesis testing procedure for this goal. Pro- 
cedure [2] provides statistical tests based on the score PPV^ (Definition [s]). The procedure tests an 
hypothesis of the first type with 9 = 0: : PV{P, Q, e) = against : PV{P, Q, e) > 0. 



Procedure 2. Similarity testing based on PPVk- 






Input." e level, number of projections K, and significance level a. 




For i = 1, K do 






1. Sample Sn = {xi, a:„} ^ P and 812 ~ {yi, ■■ 


■ ,y,n} ^ 


Q i.i.d. examples on [0, 1]''. 


2. Sample a unit random vector r,j G S*^^^. 






3. Project to ID: Psn — {rfxi, rfxn} and Ps, 


2 = {rfvi, ■.■,rfy^}. 


4. Compute PV{Psii , Psi2 , e). 






end for 






Compute PPVk — 'cn-8^^i=i,....K PV{Psii, Psi2,e). 






Compute t = ^losW+21og(2(2V.-2))+21og(l/o)^^^^^ ^ 


min(n, 


to). 


Output.- Reject Ho if PPVk > t. 







The next corollary bounds the Type 1 error of Procedure |2] and shows that the test is consistent. 

Corollary 11. Assume that the null hypothesis holds: Ho '■ PV{P, Q, e, d) = 0. Then, for the 
threshold t of Procedure^and any a £ (0, 1) we have that 

Vo(^PPVK{Si,S2,e)>t^ <a. (11) 

Moreover, when N ^ od, K ^ 00 and ^^^^ -> we have that Vi{PPVKiSi, S2, e) > t) = 1. 

The bound is obtained by Theorem[9] The consistency is conditioned on Assumption[T] and obtained 
by Theorem TO" 



Theorem 10 bounds the Type 2 error of Procedure |2] which is dependent on the number of projec- 
tions K, and the fraction q that is distribution dependent. The bound exponentially decays as K 
grows, and therefore, to gain statistical power, a larger number of projections can be used. 
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A.3 Proof of Theorem|3] 

We restate the theorem for clarity: 

Theorem pi Suppose we are given two i.i.d. samples Si = {xi, G and S2 — 
{2/1, G generated by distributions P and Q, respectively. Let the ground distance be 

d = II • lloo <^nd let J\f(e) be the cardinality of a disjoint cover of the distributions' support. Then, 

for any 6 G (0, 1), N = min(n, to), and Tj = \J — ' '~2))+iog(i/A")) have that 



< 7? > 1 - (5. 



PV{Si,S2,e)-PV{P,Q,e) 

Before proving the theorem we present the required definitions and lemmas. The proofs of the 
lammas are presented immediately after the proof of the theorem. We assume the domain is totally 
bounded, and, for simplicity of presentation, we assume the metric space is ([0,1]'', doo — \\ ■ ||oo)- 

We define a discretization on the support of the distributions. 

Definition 12 (Discretization). The e-discretization over the space ([0, l]'', doo = || • ||oo) is a 
partition on the set C(e) = {oi, ...,ajv}, with cardinality N = Each element in C(e) is 

the center of a box of volume e'^. The boxes do not intersect, and their union covers [0, 1]''. Each 
ai G C(e) has a density equal to the distribution's mass in its neighborhood: B{ai, doo, = {-^ : 
doo (0^,2;) < e/2}. 

We refer to the resulting discretized versions of the distributions P and Q as /ii(e), /^2(e) re- 
spectively. Also, let /ii(e), jj.2{f) be the histograms of the samples Si and ^2, defined on the 
e-discretization C(e). 

The proof of Theoremplis based on formulating the relations between PV(S'i, 5*2) and PV(/ii, /ii), 
and between PV{P, Qj and PV{iJ.i, ^2)', then, turning to the discrete versions, bounding the differ- 
ence between PV(/ii, jji) and PV^(/ii, /i2). 

The relation between the different versions of the PV, continuous, discrete and sampled, is provided 
in the next lemma. 

Lemma 13. Let Si = {xi, ...,a;„} ~ P and S2 = {j/i, ...,?/m} ^ Q be two samples. Let /ii(j^) 
and ^2{v) be the v-discretizations of P and Q for any integer T > 1 and v — ^. Let /ii(j^) and 

pL2{i^) be their empirical distributions. The following relations hold for any e, e' = ^^^^r^, e" — 

and d = II • lloo : 

PV{fii,fi2,e") <PV{Si,S2,e) <PV{t,i,fi2,e') (12) 

PV{tii,fi2,^") <PV{P,Q,e) <PV{fii,fi2,e'). (13) 

We use the following structure of two discretizations. 

Definition 14 (Refinement of a discretization). Define an initial e-discretization Ci(e) = 
{bi, ...,6jv(e)} on ([0, l]"*, II • lloo)- The refinement of the discretization, for any e and T > 1, is 
defined as a v-discretization on C2{v) = {ai, a7V(i/)}. where v — e/T, such that each element 
of the refinement is a result of splitting an element of the initial cover to (e/T)'^ elements of equal 
volume. 

The next lemma bounds the difference between the PV on the discrete distributions jliiy) , ji2{v) 
and the distributions iii{v) and ^2{v). 

Lem ma 1 5. Let Ci(e) be an e-discretization on [0, 1]'', and C2{v) its refined discretization (Defi- 
nition\14\. Let fli{e) and fJ-i{e) be distributions on Ci(e), and fiiiiy) and /ii(i^) distributions on the 



refinement C2{v). For any e G (0, 1) and d = || • ||oo we have that 

\PV{fLl{u),(L2{v),e)-PV{^il{v),^l2{ly),e)\ < i (||/ii(e) - Ai(e)lli + ||M2(e) - A2(e)||i) ■ 

Observe that the ii-norm is computed over the elements of Ci (e). 
We use the following result provided by lfT4l (Theorem 2.1). 
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Lemma 16. Let fi be a probability distribution on the set A — 1, ...,a. Let X = xi,X2, ...jX^r 
be i.i.d. random variables distributed according to /i, and [in the resulting empirical distribution. 
Then, for rj > 



^^^-j^, and define 



Proof of Theorem^ Set e' = '^^-'^ and e" 

m(r) = PVi^il{ly),^i2{ly),e') - PV[iii[y), e")- 
By Lemma 13 the value of m(T) is positive. Combining Lemma 13 with Lemma [Ts] yields 



PV(5i,52,e) <PV(/ii(i.),A2M,e') 

< PV(AiiM,/i2H, e') + ^||/ii(e') - Ai(e')lli + " A2(e')lli 

= PV(MiM,/i2H,6") + m(T) + \\\^,^{e") Ail^'Olli + ^llA^2(e") - A2(e")l|i 

< PV(P,Q,e) +m(T) + i||Aii(6') - Ai(e')lli + ^llA^2(e') - A2(e')lli- 



(14) 



Recall that the number of elements for an e-discretization on Ci (e) is N{e) = (1/e)''. Apply Lemma 
16 to ||/ii(e') — /ii(e')||i < r/ and ||/i2(e') — /i2(e')lli — ^7 ^i^^ combine the result with (14i using 
the union bound. We have that with probability at least 1 - 2{2^'^ / ~ 2)e"^''^/2 



PV(^i,52,e)-PV(P,Q,e) <TO(r) + 77. 
In a similar manner we have 

PV(^i,52,e)>PV(Ai(z^),A2H,e") 

> PV0iiH,A.2H,6") - i||A.i(6") - - 2llM2(0 - A2(OI|l 

= PV(MiH,A.2H,e') - ™(T) - i||A.i(6") - Ai(e")||i - ^||M2(e") - A2(e")lli 

> PV(P, 0, 6) - m(T) - - Ai(£")lli - ^llA*2(e") - A2(e")lli- 



(15) 
(16) 



Combining the result with the tail bounds of /li, /t2 from Lemma 16 and applying the union bound, 
we have that with probability at least 1 - 2(2(1/'")'' _ 2)e-^'''/2 



PV(P, Q, e) - PV(5i, 52, e) < m(r) + 77. 



(17) 



For T ^ e we have that e' « e" = e , and therefore the value of TO(r) — > as T — > 00. 
Taking T 00 in (15 1 and (17i, and combining the result we get that for any 8 E (0,1) and 



_ / 2(108(2(2(1/^)" -2))+log(l/a-)) 
'/ V N 



PV(5i,52,e)-PV(F,g,e) 



> 7? < 5. 



□ 



Proofs of Lemmas [l3|15| 



Proof. Lemma 13 



Let sample Xi e Si belong to the element Ofe in the ;/-discretization, that is Xi € B{ak, \\ ■ ||oo, = 
j^). Recall that the e-neighborhood of a sample Xi is the set ng(xi, e) = {z : d(xi, z) < e}, and the 

-neighborhood of bin is the set ng(afe, ^^^jtll) = : d{ak, z) < ^^-^tll}. For the left 

side of Jl2b, observe that for any such Xi there exists values of z such that \\z — ak\\oo < '^^'^ 
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but \\z — Xi\\oo > while for any z for which ||z — Xi\\oo < e also \\z — Ofe lloo < '-^^Y^. As a 

result, ng(xi, e) C ng(afc, '^^-'^^-^ ). Enlarging the number of neighbors adds edges to the bipartite 
graph describing the problem, and accordingly, a matching with a larger cardinality may be found. 
In such a case, the number of unmatched samples is decreased, and therefore the PV is decreased, 
as it is the normalized sum of the unmatched samples. 

For the right hand side of |l2|, observe that when the discretization is ^^21zl)^ for any point Xi e 

B{ak, II ■ lloo, i^) we have that ng(a;i, e) I) ng(afc, "^^-^j^), as the e-neighborhood of each point mass 

encloses the -neighborhood of its ascribed bin. As a result, the PV between the histograms 

fii and jj,2 may correspond to a graph that has less edges, which may result in a maximum matching 
with a smaller cardinality. As a result, the discrete version may have a larger PV. 
Inequalities ( [T3] l hold, as the same claims apply for the discretization of the distributions. □ 

The following representation of Problem Q will be useful for the proof of Lemma [T5| 
Lemma 17. The solution of Problem (|2j may be obtained by solving the following problem 

N N 

Wi.Vi.Zi^ Z ^ — ^ Z ^ — ^ 

i=l 3 = 1 

+ Wi = /ii(ai), I = 1, ...,iV 
^ Z,j + Vj = /i2(aj), j = 1, ...,N 

Zij > 0, yi,j, 

which we call PVeqipiii^), IL2{v)^ e). 

The lemma states that the constraints Wi > 0, Vj > may be removed, and instead the sum in the 
objective is taken over the absolute values. 



Proof. Lemma 17 



First note that any solution of Problem (j2]i is a feasible solution of Problem (18 i, and so we have 
that the optimum PV(/Lti(z^), /Z2(i^), e) > PVeg(/ii(j^), /^2(j^), e)- We construct a solution of (j2]) 
that realizes the equality, and therefore is optimal. Namely, to show the problems are equivalent 
it is sufficient to show that any solution of ( fTSj l has a corresponding solution of Q with the same 
objective value. 



18 1. In the following, we construct a feasible solution Wi,Vi, Zi. 



Let Wi , Vj , Zij be the solution to 
to (|2]): 

If < and Vi > set — \wi\ and 

Wi = Wi + Ai = 0, Vi = Vi + Ai > 0, ^ = ^ ~ Ai. 

ajgng(ai) ajgng(ai) 

If < and Wi > set Tj — \vj \ and 

{ij = + Lj = 0, iwi = m; + r,j > 0, ^ Zji^ ^ Zy^-Ti. 
If both Wi < and Vi < Q set 

Wi = Wi + A, + r, > 0, Vi ^v, + A, + Ti>0, {Z,j + Zyi) = ^ 

ajEng(ai) aj£ng{ai) 

Otherwise, set Wi = Wi, Vj = Vj, and Zij = Zij. 

The resulting Wi, Vj, Zij obey the equality constraints in (j2| while fixing Wi > 0, Vj > 0. It is easy 
to show that there exists Zij > that obeys the equalities above. The objective value of ( 18 i with 



A,: 
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Wi,Vj, 



Zij and of (j2ji with Wi, vj, Zij is equal: 



N N N N 

+ = ^{w, + Wi)l[^.>o,i,,>o] + + Ai) +v, + Ai)l[^.<o,„,>o] + 

i=l j=l i=l 2—1 

N N 

[wi<0 , Vi<0] 

3=1 i=X 

N N 



3 = 1 



We conclude that Wi , Vj , Zij attains the optimal solution to Problem □ 



Proof. Lemma 15 



Let Z*j, w* ,v* be the optimal arguments for which PV(/j,i, /j,2, e) is obtained (Problem (|2|). There 
are two stages to bounding the difference between PV(/ii, /Z2, e) and PV(/ii, /i2, e)- First, by Lemma 
we know that given a solution PYeqifi'i, A21 e) we can find an equivalent solution PV(/ii, fi2, e) 
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Ss a result, we may bound the difference between PW{fii, H2, e) and PWeqifj-i, P-2,s) instead of 
the difference between PV(/ii, /i2, e) and PV(//i, //2, e). To bound this difference, we change the 
solution Z*j,Wi,v* to describe a feasible solution to Problem ( 18 1 for distributions fii and /t2. 



To obtain a feasible solution to Problem ( [T8| l, we must fix the violations that are made to its con- 
straints by substituting Z*j,w*,v* into Problem (18 1. The constraints are fixed in two manners. 



Some constraints are fixed by optimizing the transportation plan, described by matrix Z, within the 
refinement of the discretization. Additional violations are fixed by changing the variables Wj and 

Define Sk = {ui : £ B{bk, ^_\\oo, e)}; i-C-, the set of bins G C-zii^) that are a refinement of 
element bk G C'i(e) (Definition 14 1. Let \sk \ be the cardinality of this set. By definition, all the bins 
in Sk are e-neighbors: Va^ G Sfc, Sk G ng(ai, e). 

For any a^, aj G Sk, consider the following feasibility problem: 

(19) 



where 



Find Cij 




Cij — Ci, 


Vci G Sfc, 








Vflj- G Sfc, 






+ a, > 0, 


Vci, Oj G Sfc, 


Ci = ifii{ai) - /ii(aj)) -~ 


\Sk\ 


bj = (A2(ai) - Maj)) - 


] — 1 (A2(6fc) - f^2{bk)) 

\sk\ 



Note that q and hi may be positive or negative, and that J^a-esk "^i ^ ^^'^ 12a esk ^ 

We show that the following values Wi,Vj,Zij fori,j ~ l,...,N{i') are a feasible solution to Prob- 
lem ([T8]l. 

Wi = w* + -r^{ili{bk) - fii{bk)) (20) 
Vj = Vj + I — Amibk) - M2(6fc)) 

\Sk\ 

Z,j = {^*^i if flj G 4., a, G Sfc, 



^ij + C'-y if flj G Sk, ai G Sfc, 
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where Ca is the solution to the (fT9 1 



First, we show that Problem (fT9]l is feasible. To do so, we consider its dual representation. Define 



V = Vec({Cij}ai.ajgsfc) G the vector form of the sub-matrix {Cij}ai,ajesfc ■ Similarly, let 

z* = Vec({Z*,}a,,a,esJ G MI'^I'''^ Let A £ M2|s,|x|sfc|^ be the zero-one matrix defined by the 
left-hand sides of the equality constraints in 



19i, and d = [ci, C|s^|, 61, ...,b\s^\]'^ e 



p2|sfc|xl 



the vector defined by the right-hand sides of these constraints. Using these notations. Problem ( 19 1 
is equivalent to 

Find V 

Av ^ d, -V - z* <0, 
whose dual representation is the existence of A e M'*'' ' ^ ^ , G IR^I*'= I ^ ^ for which 

g{X,ri) = M {-V - z*) + r]^ {Av - d) > 0, (21) 

V 

A > 0. 



The value of g{X, 77) in (21 1 is not —00 only when A^r/ — A = 0, for which 

g{X, 77) = inf A^(-u - z*) + 7j'^{Av - d) = 

V 

inf w^(-A + A^rj) - X^ z* - if d = -X^z* - rfd. 

V 

Since 5:* > and A > 0, we have that -A^z* < 0. By noting that l^d = ^ Q + yi ^ = 

0, we have that —rj^d < — minrje ■ l^d = 0. We conclude that ^(A, 77) < 0, and therefore Problem 
( [2T| l is infeasible. By the theorem of alternatives 1 15 1 Problem ( [T9] l is feasible. 



Next, we show that the proposed solution Zij, Wj , Vj is indeed a feasible solution of Problem ( 18 1 
The constraints Z^j > hold by the feasibility of U9i. The equality constraints also hold: 



ajGng(ai,e) 

ajGng(ai,£) 
1 



■ Wi 



E 

aj Gng(ai 



- E + 



^ + E 

aj£ng{ai,e) 

ifiiibk) - fJ.i{bk)) + w* + T^(/ii(fefc) - Mi(&fe)) 
|Sfe| \sk\ 



and 



E 
E 

1 



E ^+EG'. + ^.= 

E + A2(aj) - M2(ai) 

aiSng(aj ,e) 

(A2(&fe) -M2(&fc)) — 



^J ' &J + 



, , , AMbk) - tJ.2ibk)) 

Sk\ ' \Sk\ 

= Ai2(aj) + (A2(aj) - /i2(aj)) = /i2(aj). 

To conclude the proof, we bound the difference of the objective of Problem (j2]l, obtained with the 
values Z*^ ,w*,v*, and the objective of Problem i ' " " 



18|l, obtained with the values Zij, Wi, Vj. 



Since the discretization defined on Ci (i^) is a refinement of C2 (e) (Definition 14 1, we have that 



N{u) N(e) 

E (i^^i + i^^i) = E Ed^^i + i^^D- 

i—l k—1 aiGsk 



(22) 
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Substituting the values of jwij, |tii| by their assignment in (20i we obtain 



(23) 



1=1 



i=l 



k=l a^ESk 
-, N{e) 

Jee 



\Wi \ + \v, 



^Ek+o 

i=l 



N{e) 



k=l aiESk ' ' k=l aigSfc 

Applying the triangle inequality on each element in the sum: 



\sk\ 2 



\w* + T^^iMh) - fii{bk))\ < \w*\ + j^\fii{bk) - ^il{bk)\ 

\Sk\ \Sk\ 

\v* + T^^iMh) - M2(6fc))l < \v*\ + j^AMh) - M2(6fc)l, 

\Sk\ \Sk\ 

as well as noting that w*,v* > by definition, we have that 

By Lemma 17 we have that the solution of Problem (j2]l may be obtained by solving Problem ( 18 1. 
Therefore, combining ([24| with Lemma [17] we have that 



^ E (l^'^l + 1^*1) ~ 9 E (^^* + <) ^ 9 11/^1 W - ^iWlli + 9ll/^2(e) - /i2(e)||i. 



The first inequality holds as the solution Zij,Wi, Vj is a feasible solution of Problem ( 18 1, but may 
not be optimal. 

Using an analogous procedure starting at the optimal solution PV {fii{i^) , fi2{'^) , and finding a 
feasible solution for distributions fiiiy) , ii2{v) we obtain 

P^(MiH,/i2H,e)"mAiM,A2M,e) < llm(e)-Ai(e)^lli + ^llM2(e)~A2(e)||i. 
Combining the last two inequalities concludes the proof of Lemma [15] □ 

A.4 Proof of Theorem [4] 

We restate the theorem: 

Theorem [4] Let P = Q be the uniform distribution on S''"^, a unit (d — 1)— dimensional hyper- 
sphere. Let Si = {xi, ...,xn} ~ P and S2 = {yi, j/at} ^ Q be two i.i.d. samples. For 

any e, e',(5 G (0,1), < 77 < 2/3 and sample size 2(i°-3l|'/'2V ^ ^ ^ ri/2e'^^^~^^^^ , we have 
PV (P, Q, e') = and 



nPV{SuS2,e)>rj)>l-S. 

Proof. We use the following definitions and lemmas. 
Deflnition 18. The spherical cap of radius r about a point x is 

C{r,x) = {z e S"^^^ : d(z,x) < r} . 



(24) 
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Lemma 19. The spherical cap of radius r about a point x on a unit sphere is equal to 



Lemma 20. Let ry = \/l ~ For < 77 < 1, the cap C{r, x) on S'^ ~ 1 has a measure at most 



2 



Let p = P{ngg^ (x) = 0) be the probability of an empty neighbor set. The next lemma bounds this 
probability. 

Lemma 21. The probability of an empty neighbor set V{ngg^ {x) = 0) > 1 — Ne~'^'^^^^^^'^ . 
Proof. 

p ^P{ngs,{x) = 0) = 1 - P(ng5^(x) ^ 0) = 1 - F{3y, e S2 ; % G C{e, x,)) 
> 1 - N¥{y e C(e, a;)) > 1 - A^e"''(^-T)/2^ 



where the first inequality is due to the union bound, and the second by Lemma 20 □ 



We consider the probability that the PV is grater than some < 77 < 1. Note, that since 
PV{P, Q) — this is also the difference between the empirical and distributional PV. Let 
e = {xi E Si : ngg^{xi) = 0} be the set of samples in 5*1 without neighbors, and Ne its car- 
dinality. 

- Af 

P{PV{Si,S2,e) >v)> ]P(^ >??) = !- P(^e <Nr])>l- ¥{Ne < \Nr]']) (25) 



1=0 

The first inequality holds, as ^^(5*1, ^2, e) > 77 is obtained when > 7]N samples from 5*1 have 

no neighbors from S2 in their e-neighborhood. Note that since n = m there are also exactly Ng 
sample from 5*2 which are not matched. 

By Chernoff's inequality we have that 

E (^)(l-p)>^"'<exp(-2iV(p-ryf). (26) 



Combining Equations ( |25| and ( p6| we get 

FiPViSi,S2, e)>7^)>l- exp(-27V(p ~ t?)^). (27) 



By Lemma 
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we have thatp > 1 - iVe"'^(^"T)/2. 
If < 77 < 2/3 and iYe^'^^^^'^ '/^ < 77/2, we have that 

75 - 77 > 1 - iVe-''(i-T)/2 -r]>l- 377/2 > 0. 
Substituting the last inequality to ( [27| : 

P{PV{Si,S2, e) > 77) > 1 - exp(-27V(l - 377/2)2). 
The theorem statement is obtained for any iV, d and 77 for which 2iV(l- 377/2)2 > log(i). □ 
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